Prosper Loan Data

knitr::opts_chunk$set(fig.width=15, fig.height=10) 

Loan Data from Prosper

Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry.In this project, I use their dataset to explore their data which contains 113,937 loans with 81 variables.

Since the dataset is too large and has around 81 variables,I’m subsetting my dataset to consider only those variables which I think will give me a good detailed explanation of my borrowers!

## 'data.frame':    113937 obs. of  24 variables:
##  $ ListingNumber            : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ CreditGrade              : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ ListingCreationDate      : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate               : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ ProsperRating..Alpha.    : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ LenderYield              : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ TotalProsperLoans        : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ CurrentDelinquencies     : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ DelinquenciesLast7Years  : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ AmountDelinquent         : num  472 0 NA 10056 0 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationQuarter   : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...

Cleaning the data variables

After eye-balling the dataset, I saw that there were some variables which required reverse coding, numeric-to-factor conversion and other slight changes.

  1. Combining the Creditscore lower and upper range to get a single values for the Credit Score given by the Consumer Credit rating agency. I decided to find the mean of these two columns and assign it as the Credit Score of respective listing.
## Warning: package 'bindrcpp' was built under R version 3.3.3
## 
##     Defaulted     Completed       Current Late Payments 
##         17010         38279         56576          2067
  1. The listing category variable is of type numeric. COnverting this into a category based on the numeric coding given in the excel :

0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

## 
## Debt Consolidation   Home Improvement           Business 
##              58308               7433               7189 
##      Personal Loan        Student Use               Auto 
##               2395                756               2572 
##    Baby & Adoption               Boat Cosmetic Procedure 
##                199                 85                 91 
##    Engagement Ring        Green Loans Household Expenses 
##                217                 59               1996 
##    Large Purchases     Medical/Dental         Motorcycle 
##                876               1522                304 
##                 RV              Taxes           Vacation 
##                 52                885                768 
##            Wedding              Other     Not Applicable 
##                771              10494              16965
  1. Lets extract the year from the Loan Origination Date and naming it Origination Year.

4.From the variable dictionary, the two columns- CreditGrade & ProsperRating (Alpha) give us thhe credit rating for loans pre-2009 and post 2009 respectively.

Joining these two columns into a single column called CreditRating.

Univariate Plots Section

Univariate Plots

1. Loan Origination Amount

Since our dataset is about the loans, lets see how much loan amount is usually requested.

Plotting the histogram for LoanOriginationAmount.

From the above plot we see that the loan amounts 4000, 10000, 15000 and 20000 have count value more than 10000. This means that the bulk of loaners from Proper usually borrow smaller amounts. Lets, see why? Is it cause of the high Interest rate or based on the borrower characteristics?

2. Term

Lets see how long are these loans usually taken for. It’s expressed in months in the dataset, converting months to year by dividing the term value by 12.

From the above plot, we can clearly say that most of the loans are lent on 36 months (= 3 years) or 60 months (= 5 years) basis. Very few are taken on 12 month basis

3 . Loan Status

Lets see what’s the status of most loans?

So we see the certain loans have been defaulted and there are a lot of bands which show that they are quite some late payments.

## [1] 4.404188
## [1] 1.81416

The exact percentage of cases that are defaulted is 4.401% and number of late payments are 1.81%

4 . Listing Category

Let’s explore what is the major category under which Prosper loans are issued

From above plot, its clear that most of the loans are issued under Debt Consolidation and the rest ambigious ones (Not Applicable) comes next but is much below the 20,000 count mark.

Borrower’s Characteristics

Let’s explore the data to understand more about our borrowers.

1. Borrower’s State

Looks like most of the borrowers are from Arizona State!

2. Rate at which the borrowers are given the loan:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at the above plot, it looks like most of the loans are borrowed at the rate of 0.15 % -0.17% The next largest bin is 0.1% followed by 0.2% And looking at the whole graph, it looks normally distributed!

3. Income Range of the borrowers:

Looks like most of the borrowers belong to low to mid income range. The largest bin has users of income range $25K-50K followed by $50k-75k

4. Credit Score:

## Warning: Removed 5584 rows containing non-finite values (stat_bin).

So from the above plot its clear that most of the borrowers have a credit score of 650-750.

Univariate Analysis

After creating new variables from the existing dataset, my final dataset of 26 variables consists of 8 factor variables and rest integer variables.I have chosen the following variables to explore the characteristics of Prosper borrowers and the company as such. The most interesting plot which i found from my univariate analysis, is the time series plot where we saw how many loans were issued per quarter since 2006 to 2014-Q1. The second most interesting factor about the borrowers of Prosper is that the companies borrowing are all mid-level companies issueing loans at the rate of 0.15-0.17% mainly for Debt Consolidation purposes. Another interesting characteristics of these creditors is that they all have an average credit score between 650-750 and are from the state of Arizona.

Bivariate Plots Section

Lets explore and see if there is any relation between the Credit Rating and Credit Score. Just to re-iterate, Credit Rating is a variable which i created by clubbing the columns - Credit Grade and ProperRating..Alpha

## Warning: Removed 852 rows containing non-finite values (stat_boxplot).

The Credit Rating assigned at the time of the listing was created: 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. So, looking at the plot, its clear that as the rating given my prosper is directly propostional to the credit score of the user. But though rating ‘AA’ has higher rating ‘A’ but the median score of AA looks lower than A. Let’s see what summary statistics tell us about this:

by(loan$CreditScore, loan$CreditRating, summary)
## loan$CreditRating: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   649.5   709.5   729.5   739.2   769.5   889.5      73 
## -------------------------------------------------------- 
## loan$CreditRating: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   689.5   769.5   789.5   787.7   809.5   889.5      92 
## -------------------------------------------------------- 
## loan$CreditRating: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   609.5   689.5   709.5   712.4   729.5   869.5      77 
## -------------------------------------------------------- 
## loan$CreditRating: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   609.5   669.5   689.5   689.7   709.5   889.5      98 
## -------------------------------------------------------- 
## loan$CreditRating: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   609.5   629.5   669.5   671.3   709.5   869.5      78 
## -------------------------------------------------------- 
## loan$CreditRating: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   549.5   609.5   649.5   647.3   689.5   869.5      71 
## -------------------------------------------------------- 
## loan$CreditRating: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   529.5   669.5   632.4   689.5   869.5      91

From the summary statistics, we see that the median value for rating ‘AA’ is lesser than ‘A’ by 50 point and even the Mean value. Reading more about how Prosper’s properitiary system, it looks like they grade loaners AA not just by their credit history but also internal scoring system called Prosper Score and only those are given AA if they fall under low risk category and get lower interest rate (5.48%)

Now exploring Amount borrowed over time by Term

Though the plot isn’t very clear, but it looks like most of the large amount of money has been borrowed for long term usually for 60 months which is a 5 year period.

Lets plot Listing category and the loan Amount

Looks like highest loan amounts are borrowed under ‘Debt Consolidation’ and ‘Baby&Adoption’ category.Lets look at what rate at which these loans are issued

ggplot(aes(x=ListingCategory,y=BorrowerRate), data = loan) +
  geom_boxplot() +
  ggtitle('Rate of Interest for various Categories') +
  xlab('Listing Categories') +
  ylab('Borrower Rate')+theme(axis.text.x = element_text(angle = 45, hjust = 1))

Though the amount borrowed under Debt Consolidation and Baby Adoption is high but the borrower rate is pretty low. The highest rate is for Cosmetic Procedures and Household expenses.

Looking at Income range vs. Occupation

Looks like Nurses have taken a good share of loans and they seem to fall under the less risky category. But at the same time, they seem to hold good share every credit rating grade.

Now, lets see how the loan status varies w.r.t Loan Origination Amount

ggplot(aes(x=loan$newStatus, y=loan$LoanOriginalAmount,fill=loan$newStatus), data = loan )+geom_boxplot(lwd = 0.75, outlier.color = I("#FBB448"), outlier.shape = 1) + scale_x_discrete(limits = c("Defaulted", "Completed", "Current", "Late Payments")) 
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

# Bivariate Analysis

After looking at all the plots in the Bivariate analysis, it looks like the Nurses are more in number of loans issued. They seem equally distributed in all the ratings provided by Prosper.This relates to why the lower income range [25000-49000] have issued the most number of loans. The most amount of loans where taken under the category of Debt COnsolidation and Baby & Adoption but the rate of interest was highest for the categories Household Expenses and Cosmetic Procedures.

The last plot where I explore the loan amount with the loan status. I derived the column column newStatus from the original column Loan Status, where I grouped statuses - Charged Off and Defaulted into Defaulted, Final Payment in Progress into Completed, all the late payments with various day deltas into Late Payments category.I removed the cancelled category and converetd them into NA.

The total dstribution for this new column is as follows

## 
##     Defaulted     Completed       Current Late Payments 
##         17010         38279         56576          2067

Multivariate Plots Section

Correlation Matrix

## Warning in ggcorr(loan, label = TRUE, label_size = 3, hjust = 0.8,
## size = 2.5, : data in column(s) 'CreditGrade', 'ListingCreationDate',
## 'LoanStatus', 'ClosedDate', 'ProsperRating..Alpha.', 'BorrowerState',
## 'Occupation', 'IncomeRange', 'LoanOriginationQuarter',
## 'LoanOriginationDate', 'newStatus', 'ListingCategory', 'CreditRating',
## 'Term.f' are not numeric and were ignored

The correlation matrix revealed some interesting facts. The variables with correlation of 1 between variables Lender Yield and Borrower Rate, Creation Year and Origination Year, CreitScoreRangeUpper and CreditScoreRangeLower is all because they are more or less the same variables. The highest correlation in this matrix, is between the Current Delinquencies and Delinquencies Last 7 years which is 0.4 which means that those accounts are still open. One more surprising thing I noticed was that the BorrowerRate and CreditScore have a negative correlation of 0.5 and ProsperScore w.r.t LenderYield has a negative correlation of 0.6, this is unexpected since the score usually determines the risk factor which indirectly determines the rate of interest at which the loan is issued.

## Warning: Removed 3525 rows containing missing values (geom_point).

LenderYield w.r.t DebtIncomeRatio

## Warning: Removed 9090 rows containing missing values (geom_point).

The above two plots of DebttoIncomeRatio vs. Credit Score and LederYield respectively. The first plot shows that there is overplotting near the lower end of DebttoIncomeRatio, the defaulters tend to have low credit score. There is a line of plots at the higher end of DebttoIncome ratio which are mostly the ones who have completely paid back their loans. The second plot with Lender Yield, also has overplotting at the lower end of Debt-Income ratio. Most of the low risk prosper grades have a very low interest rate compared to the ones at higher risk.


FINAL PLOTS AND ANALYSIS

Plot One : Time series Plots

Time series plot help us to better visualize the performance of Prosper, we can see the different trend lines, ups and downs of the company.I’m creating a separate dataset from our main dataframe Loan, with the Quarter variable and the original loan amount.

Loans per Quarter since the start of Prosper Marketplace

We see that the quarter4 in the year 2008, the number of loans taken gradually decreases and there are no loans issued in Quarter 1 of the year 2009, Q2-2009 there are only 10 loans issued and then it slowly increases. This pitfall is because of the Great Depression of 2008. The last quarter of 2013, Propser issued the maximum number of loans of 144500, followed by the year 2014, we have only the first quarter information and during that time, it seems like Prosper issued 121700 loans. This plot allows us to see how Prosper was performing since it has started, though we have the information only until its first quarter of 2014

Plot 2

Status of the borrowers under each Category

I believe it doesn’t give us enough details when using listing category variable for trend or relationship analysis and the category is probably better left ignored during those types of investigations. But one thing we can take away from the high volume of borrowers requesting debt consolidation loans, is that, many borrowers have current debt and ver few percent have defaulted and is by far the most frequent category.

Plot Three

## Warning: Removed 272 rows containing missing values (geom_point).

Loan Amount under each risk category over the loan period

I believe from this plot, its clear that most of the low risk loans have been issued for longer period from 36 to 60 months. The loans that were issued for a year or less is more in the higher risk borrowers.

Reflection

In this analysis, my major hurdle was reading the entire csv due to its huge volume of 114k observations and 81 variables. My first thought to reduce the runtime was to subset the dataset, with my initial understanding of the variables from the Variable Description file, I started off with 15 variables. But as i progressed into my analysis, I realized that these 15 were not enough to get the results I was looking for. This re-iteration of going back to step 1 and re-selecting the variables went on for 3-4 times!Second biggest hurdle, was trying to figure out how to get the ratings for the loans issued before 2009.Since we had a good amount of data from 2007 quarter-1. I finally figured out a way to club the ProperRating..Alpha with the Credit Grade provided, to create a new column called Credit Rating. I also wanted to use the borrower state variable to generate a R-Map to see the different counts of defaulters, loanStatus per state. But do to some discrepency with the Mappers library or due to my mistake, I was unable to do so!

There are a number of different ways to take this project further. Firstly, I’ve focused only on a small subset of the variables available in the dataset, and there is a vast amount of data I’vent explored. I think if I better understood the finance/loan terms I would do better justice to this data analysis.I think I’d like to explore the investors side a bit more; look at investor profit and losses. Also, I would like to learn about the kind of plots and graphs specifically used by the finance industry, so that I can incorporate that knowledge into any future datasets I may explore, like using density line graphs showing spikes by time period.